Search CORE

86 research outputs found

Unravelling Interlanguage Facts via Explainable Machine Learning

Author: Berti Barbara
Esuli Andrea
Sebastiani Fabrizio
Publication venue
Publication date: 02/08/2022
Field of study

Native language identification (NLI) is the task of training (via supervised machine learning) a classifier that guesses the native language of the author of a text. This task has been extensively researched in the last decade, and the performance of NLI systems has steadily improved over the years. We focus on a different facet of the NLI task, i.e., that of analysing the internals of an NLI classifier trained by an \emph{explainable} machine learning algorithm, in order to obtain explanations of its classification decisions, with the ultimate goal of gaining insight into which linguistic phenomena ``give a speaker's native language away''. We use this perspective in order to tackle both NLI and a (much less researched) companion task, i.e., guessing whether a text has been written by a native or a non-native speaker. Using three datasets of different provenance (two datasets of English learners' essays and a dataset of social media posts), we investigate which kind of linguistic traits (lexical, morphological, syntactic, and statistical) are most effective for solving our two tasks, namely, are most indicative of a speaker's L1. We also present two case studies, one on Spanish and one on Italian learners of English, in which we analyse individual linguistic traits that the classifiers have singled out as most important for spotting these L1s. Overall, our study shows that the use of explainable machine learning can be a valuable tool for t

arXiv.org e-Print Archive

Automatic Generation of Lexical Resources for Opinion Mining: Models, Algorithms and Applications

Author: ESULI ANDREA
Publication venue: 'Pisa University Press'
Publication date: 11/04/2008
Field of study

Opinion mining is a recent discipline at the crossroads of Information Retrieval and of Computational Linguistics which is concerned not with the topic a document is about, but with the opinion it expresses. It has a rich set of applications, ranging from tracking users' opinions about products or about political candidates as expressed in online forums, to customer relationship management. Functional to the extraction of opinions from text is the determination of the relevant entities of the language that are used to express opinions, and their opinion-related properties. For example, determining that the term beautiful casts a positive connotation to its subject. In this thesis we investigate on the automatic recognition of opinion-related properties of terms. This results into building opinion-related lexical resources, which can be used into opinion mining applications. We start from the (relatively) simple problem of determining the orientation of subjective terms. We propose an original semi-supervised term classification model that is based on the quantitative analysis of the glosses of such terms, i.e. the definitions that these terms are given in on-line dictionaries. This method outperforms all known methods when tested on the recognized standard benchmarks for this task. We show how our method is capable to produce good results on more complex tasks, such as discriminating subjective terms (e.g., good) from objective ones (e.g., green), or classifying terms on a fine-grained attitude taxonomy. We then propose a relevant refinement of the task, i.e., distinguishing the opinion-related properties of distinct term senses. We present SentiWordNet, a novel high-quality, high-coverage lexical resource, where each one of the 115,424 senses contained in WordNet has been automatically evaluated on the three dimensions of positivity, negativity, and objectivity. We propose also an original and effective use of random-walk models to rank term senses by their positivity or negativity. The random-walk algorithms we present have a great application potential also outside the opinion mining area, for example in word sense disambiguation tasks. A result of this experience is the generation of an improved version of SentiWordNet. We finally evaluate and compare the various versions of SentiWordNet we present here with other opinion-related lexical resources well-known in literature, experimenting their use in an Opinion Extraction application. We show that the use of SentiWordNet produces a significant improvement with respect to the baseline system, not using any specialized lexical resource, and also with respect to the use of other opinion-related lexical resources

Electronic Thesis and Dissertation Archive - Università di Pisa